Section:
New Results
Combinatorics of motifs and algorithms
We developed an -time and -space algorithm to compute
minimal absent words. Their computation is used in sequence comparison [32]
or to detect biologically significant events.
For instance, in [52] , it was shown that there
exist three minimal words in Ebola virus genomes
which are absent from human genome. The identification of
such species-specific sequences may prove to be useful for the development of both
diagnosis and therapeutics.
In our new contribution [21] we provided an implementation that
can be executed in parallel. Experimental resuts show that excluding
the indexing data structure construction time, it achieves
near-optimal speed-ups.
The computation on the human genome is accelerated by a factor of 10
when using 16 processors, but it consummes a huge amout of RAM.
Thus we are currently working on an external memory implementation,
that will provide a trade-off between space and time consumption.
Combinatorial tools have been developed to predict the length of
repetitions in a random sequence. This allows to distinguish
biologically significant repetitions or tune some parameters in
assembly or re-sequencing algorithms. For instance, unique mappability
is strongly related to the length of the repetitions. A trie
profile was defined in [45] to address this
issue for binary alphabets, by the means of analytic combinatorics.
General alphabets, where no closed formula exist, were adressed in
[24] . An alternative, and simpler, approach is
derived, thatexhibits a Large deviation Principle and makes use of
Lagrange multipliers. Different domains and transition phases are
exhibited. It is expected that htis approach extends to a Markov model
and to approximate repetitions.